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ABSTRACT 



Paper documents can be converted into digital form, as a 
collection of images, or a combination of ASCII text and 
images. Full text and image document databases, display 
advantages and disadvantages during scanning and conversion 
processes. Conversion of paper thesis documents could be 
eliminated, if thesis documents could be submitted in 
digital form, for storage on optical disks. 

Utilizing existing paper thesis documents, image and 
full text databases were developed and e\ luated to 
determine the best digital form for storage of paper 
documents. Analysis was performed on a thesis document in 
digital form, to determine the most feasible format for 
digital document submission. 

This thesis concludes that conversion of paper documents 
to digital form should not be pursued. Instead, thesis 
documents should be submitted in digital form for direct 
conversion and storage on optical disks. Follow on thesis 
research is recommended to build an in-house CD-ROM 
mastering system for this purpose. 
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I . INTRODUCTION 

A. DISCUSSION 

The storage of thesis documents, at the Knox Library 
aboard Naval Postgraduate School, requires a great deal of 
space. These documents can be converted to digital form, 
for storage on optical disks, using optical scanners and 
optical disk publishing software. Documents can be 
digitized and stored, using one of two fundamental forms; as 
bit mapped images, or as full text documents with images of 
graphics. Each form provides advantages .nd disadvantages, 
with respect to document conversion and retrieval. A CD-ROM 
optical disk will hold approximately 1200 thesis documents 
(135,000 to 207,500 pages). Since information stored on 
optical disks, is only retrievable using a computer system, 
it is vitally important that, different topical areas within 
this information, be accessible, without spending a lot of 
time or energy. 

The advent of word processors for computers, means that, 
thesis documents exist in digital form during their 
formulation and completion. It would be advantageous to 
receive completed thesis documents in digital form, for 
direct storage on optical disks. 
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This thesis analyzes the advantages and disadvantages of 
storage and retrieval of thesis documents in both image and 
full text form. It also analyzes the benefits and drawbacks 
of submitting thesis documents in electronic form. 

B. SCOPE 

This thesis includes an in-depth analysis of an image, 
and full text database. The purpose of this analysis is to 
provide an accurate determination of the advantages and 
disadvantages of each digitized form, for conversion and 
storage of existing documents on optical disks. 

Analysis of requirements for submission of completed 
thesis documents in electronic form is also conducted, and a 
recommendation for possible implementation is made. 

C. METHODOLOGY 

Utilizing thesis documents presently stored at Knox 
Library, an image and full text database was developed by 
optically scanning documents, for image generation and 
optical character recognition. Optical disk publishing 
software was used to link images of graphics with full text 
of documents, and perform markup and indexing of documents, 
for searching and retrieval. Each database was evaluated to 
determine which provided the greatest benefits to the user 
(searcher) . 
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Utilizing a digital copy (text with separate images of 
graphics) ; of a completed thesis, an experiment was 
conducted to determine the requirements for turning in 
thesis documents in electronic form. This experiment 
focused on the advantages and disadvantages pertaining to a 
digital document in word processor format; WordPerfect 5.1, 
versus a document in ASCII text format. 
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II. OPTICAL TECHNOLOGY 



A. INTRODUCTION 

The generation of a digital database for this thesis, 
and the retrieval of that information, primarily deals with 
two areas of optical technology: Optical Scanners and 
Compact Disk Read Only Memory (CD-ROM) . This section will 
discuss in general, the technologies associated with 
scanners: their basic operation, optical character 
recognition (OCR) , and image scanning. The basic operation, 
structure, advantages, and disadvantages of CD-ROM are 
discussed. 

B. OPTICAL SCANNERS 

Optical scanners convert hard copy documents into a 
digital format. This is accomplished by converting the 
document into a bit mapped image or into text using OCR. 

The following sections will describe in general: the basic 
theory of how a scanner works, the different characteristics 
of a scanned image, and optical character recognition. 

1. Scanner Technology 

There are three basic types of optical scanners: 
moving paper scanners, flat bed scanners, and electronic 
digitizing camera scanners. The primary difference among 
these scanning technologies is the method of document 
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illumination (or document transport) during the scanning 
process. (Fowler and Clipper, 1991, pp. 44-46) 

The scanning process is basically accomplished by 
five components: a group of Charged Coupled Devices (CCD); a 
light source; a set of optics to force the light image onto 
the CCD; electronics to distinguish between dark (inked) 
areas, and light (white) areas; and a transport mechanism to 
move either the paper, or the scanning device depending on 
the type of scanner. 1 (Wageman, 1989, pp. 3.112.4 -3.112.6) 
Figure 1 portrays the interaction of components during the 
scanning process. 




(Taylor, 1987, p. 10) 



1 A Charged Coupled Device is a light sensitive 
semiconductor which produces an electrical signal which is 
proportional to the light incident upon it. 
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During the process of scanning, a document is 
illuminated, a strip at a time by a light source. The 
movement of the document relative to the light source is 
accomplished by the transport mechanism. The transport 
mechanism either moves the scanning device (light source) 
across the document, or moves the document across the 
scanning surface. The reflected light is directed via a set 
of optics, onto a series of CCD's, which transform the 
optical signal into an electrical signal. The electrical 
signal is converted into binary one's and zero's, which 
represent the scanned image, created by electronics, and the 
scanner software. 

2. Scanning Images 

Unlike text in ASCII format, an image is not 
directly accessible. It is a two dimensional bit mapped 
representation. There are three characteristics used to 
discuss image scanning: resolution, levels of greyscale, and 
color. These characteristics of image scanning are 
discussed in the following sections. 
a. Resolution 

Resolution is a measure of the level of detail 
represented in an image. Resolution is typically discussed 
in terms of pixels, or dots per inch (dpi) . Pixel, origi- 
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nates from a television industry term; "picture element 2 ." 
Dpi refers to the number of pixel dots used to represent 
that image. A resolution of 400 dpi implies that, every 
square inch of that image is represented by 160,000 pixel 
dots (400 x 400) . An image with a resolution of 400 dpi 
contains more detail and thus better quality, then an image 
with a resolution of 200 dpi. 

Image resolution is often confused with display 
resolution. Image resolution refers to the resolution of 
the stored image. Unlike display resolution, which refers 
to the number of pixels the screen can display. An image 
with a resolution less than the display v .11 only take up a 
portion of the display; where as an image with a resolution 
greater than the display will encompass an area greater then 
the display. (Apperson and Doherty, 1987, pp. 124-126) 
b. Levels of Greyscale 

The number of bits stored per pixel determines 
the amount of information that can be stored for each pixel. 
The ability to store this additional information 
distinguishes black and white images from greyscale images. 
In a black and white image, each pixel is represented by one 

2 The detail of a television picture image is 
determined by a combination of picture elements per 
horizontal scan line, and the number of horizontal scan 
lines per picture. In the United States a picture contains 
525 scan lines with 435 picture elements per scan line. 

This equates to a resolution of 435 x 525. (Fink, 1984, pp. 
83-84) 
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bit which corresponds to either black or white. In a 
greyscale system two or more bits are stored for each pixel. 
The number of grey levels which can be represented by each 
pixel is determined by two raised to the power of the number 
of bits stored per pixel (2 X ) . For example in a four bit 
per pixel system, each pixel could be one of 16 shades of 
grey. 

c. Digitizing Color Images 

The representation of color images is very 
similar to that of greyscale images. Most color scanners on 
the market today produce color images by scanning a document 
three times in succession; once for each of the primary 
colors: red, green, and blue. Each of the three scans uses 
as its light source, one of three mercury vapor lamps which 
produce red, green, or blue light. Each of the three scans 
produces a color bit map, which is later merged to produce 
the color image. The number of bits stored for each of the 
primary colors, determines the displayable colors. Storing 
four bits per primary would allow 4096 colors (2 4 x 2 4 x 2 4 = 
4096). (Apperson and Doherty, 1987, pp. 143-146) 

3. Optical Character Recognition 

Optical Character Recognition (OCR) , or Intelligent 
Character Recognition (ICR) , is the process by which the 
text (alphanumeric characters) of a scanned image is 
converted, from a bit mapped representation into ASCII 
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characters. There are two primary methods used in this 
process; matrix matching and feature matching. 

a. Matrix Matching 

The process of matrix matching breaks the image 
into a matrix of blocks, corresponding to character 
locations. These blocks are subdivided into a matrix of 
pixels. The pixel matrix of the character block, is then 
compared with those of the standard alphanumeric characters, 
until a match is found that satisfies a preprogrammed level 
of confidence. 

b. Feature Matching 

The process of feature matching is similar to 
matrix matching, as the character locations are broken down 
into a matrix of pixels. In feature matching, the 
comparison is not of a whole pixel pattern, but of the 
distinct features of the pattern. These features include 
vertical, horizontal and diagonal lines, as well as loops. 

An illustration of feature comparison is shown in Figure 2. 
The determination of a correct match is again based on a 
preprogrammed level of confidence. 

For a more detailed explanation of OCR see Taylor, 
1987, pp. 12-17. 
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FEATURE ANALYSIS IN OCR TECHNOLOGY 
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Figure 2 Feature Matching (Wageman, 1989, p. 



2.32.5) 



C. COMPACT DISK READ ONLY MEMORY 
1. Basic Concepts 

Compact Disk Read Only Memory (CD-ROM) is an 
internationally accepted mass storage media. The "read 
only" capability of CD-ROM makes it ideally suited for the 
distribution of large amounts of unchanging data. 

The world wide acceptance of CD-ROM as a reliable 
mass storage media is a result of its standardization. The 
physical and logical requirements of CD-ROM are established 
by the specifications in International Standards 
Organization (ISO) 9660, commonly referred to as "High 
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Sierra Group." This standardization ensures that a CD-ROM 
disc, manufactured in accordance with the ISO 9660 standard 
can be played on any CD-ROM drive meeting the same standard. 

2. CLV vs CAV 

A CD-ROM drive is a constant linear velocity (CLV) 
device, rather than a constant angular velocity (CAV) device 
used for most magnetic drives. This means that the rotation 
speed of the disc varies, depending on the location of the 
read head. As the read head moves from the inner to the 
outer edge of the disc, the speed of rotation would decrease 
to maintain the disc area under the read head, at a constant 
velocity relative to the read head. This implies that, the 
density of the data written on the disk remains constant 
throughout the entire disk. 

3. Physical Structure 

The physical structure of a CD-ROM disc can be 
characterized by two distinct characteristics: microscopic 
pits and lands, and a continuous spiral. As shown in Figure 
3, data is represented by a series of microscopic lands and 
pits. This data is read by measuring the reflected light 
from a laser beam as it passes over the pattern of 
microscopic pits and lands. As the laser passes from land 
to pit, or from pit to land, the amount of reflected lighted 
light changes; representing a binary one. 
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Pit; microscopic depression in the 
reflective surface of a CD-ROM 
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Binary 1 : the transition from pit to land 
and land to pit. 



Figure 3 Microscopic Structure of CD-ROM Disc 



(Hoge, 1989, p.54) 



These microscopic pits are arranged in a continuous 
spiral, as depicted in Figure 4. The density of this spiral 
is 16,000 tracks per inch, providing a total storage length 
of approximately three miles. The continuous spiral of a 
CD-ROM disc is divided into 270,000 sectors, that are 
referenced in terms of minutes, seconds, and sector. 3 
There are 2352 bytes per sector; 2048 bytes of which are 
useable. This equates to a useable storage capacity of 



3 Each CD-ROM disc contains 60 minutes of play. Each 
minute contains 60 seconds. There are 75 sectors per 
second. 60 minutes/disc x 60 seconds/minute x 75 

sectors/second = 270,000 sectors/disk. 
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540,000 kilobytes per CD-ROM disk (270,000 sectors/disk x 2 



kilobytes/sector) . The utilization of bytes per sector is 
shown in Table 1. (Fowler and Clipper, 1991, pp. 36-38) 




Constant Angular Velocity (CAV) Constant Lines. Velocity (CLV) 
Concentric Tracks Spiral Track 



Figure 4 CLV vs CAV (Fowler and Clipper, 1991, p. 
37) 



TABLE 1. BYTE UTILIZATION PER SECTOR 



Synchronization data 


12 bytes 


Header data 


4 bytes 


User data 


2048 bytes 


Error Detection data 


4 bytes 


Unused data 


8 bytes 


Error Correction data 


276 bytes 


Total 


2352 bytes 
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4. Advantages and Disadvantages of CD-ROM 

If thesis documents were available in digitized 
form, what would be the advantages and disadvantages of 
storing this information on CD-ROM? 



1. Compact Storage : No other media provides the storage 

capacity for such a vast amount of information in such 
a compact size. 

2. Excellent File Integrity : The read only nature of CD- 

ROM alleviates the problem of inadvertent modification 
or erasure. The estimated 50 year life of CD-ROM 
ensures reliable file integrity when used for archival 
of data. (Arnold, July 1991, p. 40) There is no risk 
of head crashes resulting in lost data, like those 
capable on magnetic drives. 

3. Extremely Cost Effective : The information storage cost 

using CD-ROM is 4 cents/MB, compared to 90 cents/MB 
using floppy disks, 50 cents/MB using WORM and $4. 00/MB 
using winchester drives. The only media which is 
currently cheaper than CD-ROM is Digital Paper at .006 
cents/MB. The draw back of Digital paper is that it is 
an emerging technology with very expensive drives. 
(Arnold, July 1991, pp. 40-44) 

4. Slow Access Times : The only real disadvantage of CD- 

ROM, is its relatively slow access times, compared to 
Winchester and floppy disk drives. The access time for 
a Winchester or floppy disk drive is more than 10 times 
faster than a CD-ROM disc drive. 



14 



III. INFORMATION RETRIEVAL 



A. INTRODUCTION 

The electronic revolution has enabled the storage of 
vast amounts of information (in the form of documents), on 
digital media. Information in electronic form, unlike 
information on paper, can only be retrieved electronically 
via a computer system. Consequently, our ability to 
retrieve information from an electronic document database, 
is based on our ability to properly index the information, 
so it can be located effectively and efficiently by the 
user. A database with documents well indexed, can provide 
for effective retrieval of information, thus allowing for 
tremendous storage and time saving benefits, of electronic 
information, to be fully realized. 

The following sections discuss; differences between 
retrieving documents and retrieving data, the problems 
associated with document retrieval, some measures of 
retrieval effectiveness, browsing, and the different methods 
of searching for documents. 
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B. DOCUMENT RETRIEVAL VS DATA RETRIEVAL 



Retrieval of data is much different than retrieval of 
documents. A study conducted in 1984, David C. Blair points 
out four fundamental ways document retrieval differs from 
data retrieval: 

• "the way queries are answered." 

• "the relation between the formal system request, and 
user satisfaction." 

• "the criterion for successful retrieval." 

• "the factors that influence speed." (Blair, 1984, pp. 
369-370) . 

These differences are discussed below: 

1. Response to a Query 

Data retrieval systems return the data, which is the 
direct answer to the question. For example, if asked for 
Joe Smith's address, the system would return the data that 
was Joe Smith's address (if it was in the database). 

Document retrieval systems require that a document (s) be 
located by matching key words, phrases, or key fields 
contained within these documents. This implies that, an 
answer to a document retrieval system query will not be the 
data directly, but a list of documents likely to contain the 
answer. The user must then review the documents to 
determine if they contain the required information. 
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2. Question/Answer Relationship 

A document search for key words or phrases provides 
the user with a list of documents that have a high 
probability of containing the required information. This is 
based on the user's perception of what words or phrases are 
best used to match a subject area to a document search. 

This highly probabilistic process, depends greatly on the 
user's prediction of how information is presented in the 
documents. Conversely, data retrieval systems are based on 
logical relationships between a question and an answer. The 
system determines an answer based on a question. If asked 
for the telephone number of Joe Smith, the system will 
locate and provide the telephone number of Joe Smith, all 
other answers would be incorrect. 

3. Utility - The Criterion for Successful Retrieval 
Data retrieval systems provide either, an 

appropriate answer, or no data at all (i.e. the desired 
information is not in the database) . Success is based on 
correctness. Document retrieval systems provide a set of 
documents, that must be reviewed to determine usefulness. 

The measure of success, is not correctness, but how useful 
the retrieved documents are to the user; or utility. As you 
may presume, utility is a subjective measure, that varies 
from user to user. 
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4. influence on Retrieval Speed 

The specific deterministic nature of data retrieval, 
results in response times that are closely tied to system 
speed. Document retrieval is an iterative process, heavily 
dependent on the user's ability to accurately predict how 
information is presented. An accurate prediction will 
induce a query with enough specificity to isolate desired 
information while excluding most unwanted information. A 
less accurate prediction will result in a less precise 
query, with less favorable results. Consequently, the 
iterations (search/evaluate/query refinement) required to 
isolate the desired information, are an outcome of query 
formulation which directly effects overall retrieval time. 
This results in a total retrieval time that is heavily 
biased by the user, and less dependent on the system. For 
this reason, a retrieval system that provides both effective 
and efficient tools for formulating queries, will out 
perform one that does not, despite running on a faster 
system. This is particularly important in CD-ROM databases, 
whose access times are significantly slower then traditional 
Winchester Drives . 

C. FULL TEXT DOCUMENT RETRIEVAL 

J. D. Fowler and B. Clipper divide electronic document 
retrieval systems into three classes: database document 
retrieval systems, full text retrieval systems, and hybrid 
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systems (Fowler & Clipper, 1991, p. 61). Our discussion 
will be limited to full text retrieval systems. For more 
information on database and hybrid document retrieval 
systems, see Fowler & Clipper, 1991, pp. 61-67. 

1. Index Formulation For Document Retrieval. 

Full text document retrieval systems save the entire 
text of the documents. The retrieval system creates an 
inverted index file of all the unique words (minus 
stopwords) in the database. 4 The inverted index file is 
several inter-related levels of indexes. The first level, 
is a dictionary of all unique words in th •» document 
database. Each word in the dictionary ha an associated 
occurrence list of all documents, as well as the location(s) 
within those documents, of every occurrence of that word. 
When a search for a word or phrase is conducted, the 
retrieval software performs a search of the index, in lieu 
of the entire database, providing faster retrieval times. 
(Fand, 1987, pp. 84-85) 

The storage requirements associated with indexes of 
this nature range from 30 to 100 percent of the total size 
of the document database (Fand, 1987, p. 95). Therefore, an 
actual CD-ROM database is considerably smaller than the 540 



4 Stopwords, are words that are not helpful in the 
retrieval of a document. Some examples are; an, by, can, 
do, for, have, if, & when. These words are not indexed, 
saving space and speeding up the retrieval process. 
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Megabyte storage space. For example, the size of the 
database could range from 415 to 270 Megabytes, which is 
equivalent to 207,500 - 135,000 pages of text, depending on 
the size of the index. 

2. Difficulties Associated with Document Retrieval 

When searching a document database the user is faced 
with the task of formulating a query that will satisfy 
prediction and futility point criteria (Blair, 1984, p. 

371) . 

The prediction criterion, requires the user 
accurately predict, how the information they are interested 
in, is represented, in order to formulate a query that will 
best retrieve that information. The searcher must formulate 
this query in a manner that will be exhaustive enough to 
retrieve a satisfactory number of relevant documents, yet 
specific enough to exclude retrieval of irrelevant 
documents . 

The futility point criterion, requires that in the 
process of document searching, the user will continue to 
refine his/her search until the set of documents retrieved, 
is small enough in number, that the searcher is willing to 
browse through them. The futility point for most searchers 
is from 20 to 50 documents (Blair, 1984, p. 372). Research 
conducted by McCarn and Lewis in 1990, states that the 
optimum value of information as related to retrieval size, 
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is 35 documents (McCarn and Lewis, 1990, p. 498) . The user 
will therefore maximize the value of the information per 
retrieval set, if he/she limits the retrieved set to 
approximately 35 documents. 

A user's futility point is not a function of 
database size. As the size of a database grows, the 
occurrence list of unique terms in the database also 
increases. Therefore, in general, a search (using the same 
terms) of a large database will retrieve more documents than 
an equivalent search on a small database. Consequently, as 
the size of a document database increases . so will the 
number of refinements (or search iterations) necessary to 
obtain a set of documents, less than or equal to the user's 
futility point. 

Search refinements or iterations usually take the 
form of search term modification, such as adding or 
subtracting terms in combination with boolean operators 
(AND, OR & NOT). This is generally very effective. One 
should use caution when refining a search query however, as 
the addition of more search terms will not always increase 
the probability of finding the information you desire. As 
the number of search terms increases, the number of 
different searchable combinations increases. Only if the 
correct combination (s) of these terms are chosen, will the 
searcher achieve better results. As an example of this 
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highly probabilistic process, assume that the information 
desired by the user is the shaded portion of Figure 5. The 
shaded portion of Figure 5 represents, an intersection of 
the documents that contain term A, with those that contain 
term C, or all the documents that contain terms A and C. 




A: 

B: 

C: 

D: 



Documents 

Documents 

Documents 

Documents 



retrieved with term 
retrieved with term 
retrieved with term 
retrieved with term 



A. 



B. 

C. 



D. 



ASSUMPTIONS: 

A, B, C, D > Futility Point 

All set intersections < Futility Point 
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Suppose the user commenced a search with term A, 
resulting in set A. Since the number of documents retrieved 
is greater than the users futility point, the query is 
modified to include terms A AND B. Results of the query are 
shown in Figure 6. 

Since the number of documents resulting from the 
query (terms A AND B) is below the users futility point, the 
search could stop here, and the desired documents would not 
be retrieved. Additional refinement of the query by adding 
term D (A AND B AND D) , will result in a smaller number of 
retrieved documents, but will not retrieve the documents 
desired, as illustrated in Figure 7. 

Abandoning term B, and picking a new term C will 
result in a satisfactory query containing a portion of the 
relevant documents, as shown in Figure 8. Only by combining 
terms A AND C and abandoning terms B and D can the searcher 
retrieve all desired documents, as previously shown in 



section of A & B 

Figure 7 Terms A Figure 8 A AND C 
AND B AND D AND D 





Figure 5. 
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Understanding the capabilities as well as the 
potential pitfalls of query refinement through term 
enhancement, is increasingly important as the size of 
document databases increase. 

D. MEASURES OF DOCUMENT RETRIEVAL EFFECTIVENESS 

Document retrieval is indirect, probabilistic, time 
consuming, and has success measured by utility. These 
characteristics constrain the measurement of document 
retrieval effectiveness. Measures have not been found that 
diminish the effects of subjectivity on document retrieval, 
however accepted measures are used with these effects 
considered. 

The two most widely used measures of document retrieval 
effectiveness are, precision and recall (Blair and Maron 
1985, p. 290). Precision ratio, the fraction of retrieved 
documents which are relevant, characterizes the systems 
capability for retrieving only relevant documents. Recall 
ratio, the fraction of relevant documents retrieved, 
describes how well the system is at retrieving all of the 
documents that are relevant. The mathematical formulas for 
precision and recall are shown in Table 2. (McCarn and 
Lewis, 1990, p. 496) 
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TABLE 2. PRECISION & RECALL RATIOS 



Precision Ratio: PR = r/S 

Recall Ratio: RR = r/R 

R: total number of relevant documents in the database. 

S: number of documents retrieved by the search. 

r: number of documents relevant, of those retrieved. 

(Notation adopted from McCarn & Lewis, 1990, p. 496) 

Retrieval systems operate at various levels of recall 
and precision, by varying the specificity of the submitted 
query. The average performance for a query will move from 
low-precision/high-recall to high-precisi n/low-recall as 
the query is made more specific (Salton, 1986, p. 651) . 

This relationship is shown in Figure 9. 

For example, consider a search for documents pertaining 
to "micro economics". A general search using the term 
"economics" will retrieve all documents containing 
"economics." Many of the documents that contain the term 
"economics" and not "micro economics", are probably 
irrelevant (low-precision) , however those that are relevant 
would be found (high-recall) . This would place the searcher 
at the low-precision/high-recall end in Figure 9. Making 
the search more specific by searching for "micro economics" 
will increase precision, as many of the irrelevant documents 
(those containing "economics" and not "micro economics") 
will not be retrieved. Those documents that were relevant 
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to "micro economics" and only contained the term "economics" 
will not be retrieved, which will lower recall. This moves 
the searcher in the direction of high-precision/low- 
recall . 5 
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Figure 9 Precision - Recall 
Relationship. (Salton, p. 336, 
1970) 



Traversing froin low-precision/high-recall to high-pre- 
cision/low-recall is the process of satisfying the users 
futility point. Given that the user's initial query results 
in a retrieval set greater than his/her futility point, 
he/she will refine the query (i.e., make it more specific) 
to reduce the number of documents retrieved. Proper re- 
finement of a general query will result in; a retrieval set 
of smaller size, with a higher percentage of relevant docu- 
ments, and non-retrieval of some relevant documents (Blair 



5 Since S g << S g and r s < r g (as some documents that 
contain only economics may be relevant to micro economics) , 
PR s > PR g , and since R is constant, RR S < RR g . 
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and Maron, 1985, pp. 296-297). The result is moving from 
low-precision/high-recall to high-precision/low-recall. 

Precision and recall can not be used to compare 
databases of substantially different size and topical 
specialty. As the size of a database grows, recall will 
automatically decrease because of increases in the number of 
refinements necessary to achieve a retrieval set below the 
users futility point (Blair and Maron, 1985, pp. 296-297) . 
Precision can not be used, as it is unclear if the changes 
in size and topical specialty will affect irrelevant 
retrievals more strongly than relevant retrievals (Doyle, 
1975, p. 357) . 

A point of much controversy in using precision and 
recall ratios as measures of retrieval effectiveness is 
their basis on relevance. It is argued that relevance is 
not really a measure because it is not additive. For 
example, if documents were weighted on a scale from zero to 
ten based on their relevance, would two documents with 
weights of five and five, be equivalent to two documents 
with weights of eight and two? This controversy can be 
avoided if one strictly states that a document is either 
relevant or irrelevant, and dispenses with the determination 
of the degree of relevance. M. E. Leak and G. Salton define 
four variables that can affect a relevance judgement (Leak 
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and Salton, 1971, pp. 507-508). These four variables are 
listed below and are illustrated in Figure 10. 

Relevance is affected by: 

• "the type of document being judged - including 
its subject matter, level of difficulty, level of 
condensation, style." 

• "the conditions under which judgements must be 
rendered, that is the time available, the order 
of presentation and the size of the document set, 
the type of task specification." 

• "the statement specifying the information 
requirement which determines relevance." 

• "the type of judge used to render the judgements, 
that is his experience, background, attitude and 
so on. " 



A comparison of the two databases developed for this 
thesis was made by controlling these variables. Each 
database contained the same documents, although 
representations were different, the difference did not have 
any bearing on their relevance. The conditions under which 
the judgements were made were controlled so as to maintain 
uniformity, the judge making the determination was the same, 
and the requirement of relevance was identical for each 
database. 
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Figure 10 Variables of Relevance Judgements. 

(Lesk and Salton, p. 508, 1971) 

E. SEARCHING FEATURES OF FULL TEXT DOCUMENT RETRIEVAL 

There are a variety of different methods that may be 
used when performing a document search, which will provide 
different results. The searching methods that a software 
package provides, will significantly effect its usefulness, 
and should be carefully considered during its evaluation. 

The most notable searching methods are discussed below: 

1. Word Searching 

Searching for a word, is the most elementary form of 
text searching. When you perform a word search, the 
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retrieval system will locate all the occurrences of that 
word in its document database. Single word searching 
however, is quite general, and will typically result in very 
large sets of retrieved documents. For example, if you were 
interested in color scanners, and you performed a word 
search using ‘'scanner (s) " , chances are, you would locate all 
sorts of documents which discuss scanners, but only a few 
might discuss color scanners. To locate the documents you 
desire, would require browsing the retrieved document set, 
which could be very time consuming. 

2 . Phrase Searching 

Phrase searching is the capability of searching for 
phrases consisting of consecutive words within a document. 
This capability can reduce the amount of extraneous 
documents retrieved. Searchable phrase lengths range from 
approximately two to four consecutive words, depending on 
the capability of the retrieval software package. 

3. Proximity Searching 

Proximity searching is a mixture of word and phrase 
searching with greater capabilities. Proximity searching 
allows the searcher to specify words or phrases that can be 
searched for within proximity of each other. The proximity 
can be words, lines, paragraphs, or documents, depending on 
the capabilities of the software retrieval package. For 
example, to find information on "how a color scanner works" 
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you might specify a proximity search of "color scanner(s) " 
in the same paragraph as "operation". As with all full text 
searching, it is important to think of how the document may 
have been written, and pick terms accordingly, instead of 
paraphrasing ideas or topics. 

4. Boolean Operators 

Boolean operators provide the searcher with a power- 
ful tool for formulating and refining queries. Most 
retrieval programs provide three boolean operators; AND, OR, 
and AND NOT. These operators can be used in combination 
with words, phrases and sets . 6 The boolem operators AND, 
and AND NOT generally shrink the size of retrieval set, as 
the resulting documents must satisfy both of the search 
criteria. The boolean operator OR, tends to enlarge the 
retrieval set, as a document is included in the retrieval 
set if it satisfies either of the search criteria. 

5. Wildcard Operators 

Wildcard operators can be single or multiple 
character wildcards. Single character operators, take the 
place of one character; multiple character operators, take 
the place of many characters. Wildcard operators are 
particularly useful due to the varying sophistication of 
automatic indexing programs. The procedures used by 



6 A set is a group of documents previously retrieved, 
meeting the criteria of a specified search. 
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automatic indexing programs for handling word suffixes 
varies. For example, some automatic indexing software index 
only the roots of words, where as others do not. This means 
that a word search with "farm", conducted on a database 
indexed with a package that indexes only the roots of words, 
will retrieve documents also containing; farms, farmer, 
farmed, and farming. For those packages that do not index 
only word roots, similar results could be obtained using a 
multiple wildcard operator. A search for "farm?" (where ? 
is a multiple wildcard operator) , would retrieve documents 
containing terms with "farm" as the first four letters, such 
as; farm, farms, farmer, farmed, farming, farmyard, 
farmhouse, farmstead, etc. 

F. BROWSING'S ROLE IN DOCUMENT RETRIEVAL 

Browsing is vitally important to a searcher, as it 
allows them to read through information as one would read a 
book. Since this is second nature, it provides a familiar 
way to find information or validate information found 
through searching. Browsing, can be thought of as a method 
of document retrieval. Browsing deals with moving from the 
"where" it is, to the "what" is there. A user may pick a 
location in the database (a document or point in a 
document) , and then look (browse) to see what is there. 
Searching, deals with traversing from the "what" they want, 
to "where" it is, meaning that the system is told "what to 
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look for", and then provides the user with "documents where 
it is located." (Zoellick, 1987, p. 65) 

Browsing is particularly useful with document databases 
located on CD-ROM. Prior to the availability of CD-ROM 
document databases, searching was conducted on-line. Slow 
data transfer rates and high communication costs, meant the 
searcher did not have the time, or money, to freely browse 
retrieval sets, or the database. A searcher may now connect 
to a CD-ROM document database via his/her personal computer. 
This provides the searcher with fast, easy access for 
unlimited browsing, as well as searching. Thus, providing 
the searcher the full utilization of the complimentary 
search, browse features of full text document retrieval. 
(Zoellick, 1987, p. 65) 



33 



IV. DATABASE DEVELOPMENT 



A. INTRODUCTION 

Documents existing in paper form can be converted into 
digital form as bit mapped images, or a combination of ASCII 
text and bit mapped images. Each form provides advantages 
and disadvantages in the conversion process, as well as the 
electronic retrieval process. To compare the advantages and 
disadvantages for each of these processes, and to determine 
the best overall form, two databases were developed. One 
database contains primarily text, with few images, the other 
primarily images, with little text. 

The equipment used to develop these databases consisted 
of a Zenith 286 personal computer operating at 12 MHZ. The 
computer contains 640 Kilobytes of RAM, two 20 Megabyte 
Winchester Drives, and a 4 Megabyte Kofax 8202 image memory 
board, for image compression and decompression when using 
the attached scanner. Image scanning was performed on a 
Fujitsu model M3094E flatbed scanner with automatic sheet 
feeder. All documents were scanned at a 300 DPI resolution 
with supporting software for image processing and text 
recognition. 7 A Worm Drive by Laser Drive Limited, model 

7 Calera TrueScan Version 1.071 from Calera 
Recognition Systems, and Irecognize Profession Plus - Page 
Image Processing Software for Image Capture and Text 
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